Regular expression searching on compressed text

نویسنده

  • Gonzalo Navarro
چکیده

We present a solution to the problem of regular expression searching on compressed text. The format we choose is the Ziv–Lempel family, specifically the LZ78 and LZW variants. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text in O(2m+mn+Rm logm) worst case time. On average this drops to O(m2 + (n+Rm) logm) or O(m2 +n+Ru/n) for most regular expressions. This is the first nontrivial result for this problem. The experimental results show that our compressed search algorithm needs half the time necessary for decompression plus searching, which is currently the only alternative.  2003 Elsevier B.V. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Regular Expression Searching over Ziv-Lempel Compressed Text

We present a solution to the problem of regular expression searching on compressed text. The format we choose is the Ziv-Lempel family, speciically the LZ78 and LZW variants. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text in O(2 m + mn + Rm log m) worst case time. On average this drops to O(m 2 + (n + R) l...

متن کامل

A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text

We address in this paper the problem of string matching on Lempel-Ziv compressed text. The goal is to search a pattern in a text without uncompressing. This is a highly relevant issue, since it is essential to have compressed text databases where eecient searching is still possible. We develop a general technique for string matching when the text comes as a sequence of blocks. This abstracts th...

متن کامل

Pattern Matching in DCA Coded Text

A new algorithm searching all occurrences of a regular expression pattern in a text is presented. It uses only the text that has been compressed by the text compression using antidictionaries without its decompression. The proposed algorithm runs inO(2 ·||AD||+nc+r) worst case time, where m is the length of the pattern, AD is the antidictionary, nC is the length of the coded text and r is the n...

متن کامل

Threshold Approximate Matching in Grammar-Compressed Strings

A grammar-compressed (GC) string is a string generated by a context-free grammar. This compression model captures many practical applications, and includes LZ78 and LZW compression as a special case. We give an efficient algorithm for threshold approximate matching on a GC-text against a plain pattern. Our algorithm improves on existing algorithms whenever the pattern is sufficiently long. The ...

متن کامل

Survey of Global Regular Expression Print ( GREP ) Tools

The UNIX grep utility marked the birth of a global regular expression print (GREP) tools. Searching for patterns in text is important operation in a number of domains, including program comprehension and software maintenance, structured text databases, indexing file systems, and searching natural language texts. Such a wide range of uses inspired the development of variations of the original UN...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Discrete Algorithms

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2003